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Abstract 

The identification of compound-protein interactions plays key roles in the drug development toward discovery of 
new drug leads and new therapeutic protein targets. There is therefore a strong incentive to develop new efficient 
methods for predicting compound-protein interactions on a genome-wide scale. In this paper we develop a novel 
chemogenomic method to make a scalable prediction of compound-protein interactions from heterogeneous 
biological data using minwise hashing. The proposed method mainly consists of two steps: 1) construction of new 
compact fingerprints for compound-protein pairs by an improved minwise hashing algorithm, and 2) application of 
a sparsity-induced classifier to the compact fingerprints. We test the proposed method on its ability to make a 
large-scale prediction of compound-protein interactions from compound substructure fingerprints and protein 
domain fingerprints, and show superior performance of the proposed method compared with the previous 
chemogenomic methods in terms of prediction accuracy, computational efficiency, and interpretability of the 
predictive model. All the previously developed methods are not computationally feasible for the full dataset 
consisting of about 200 millions of compound-protein pairs. The proposed method is expected to be useful for 
virtual screening of a huge number of compounds against many protein targets. 



Background 

The identification of compound-protein interactions is an 
important part in the drug development toward discovery 
of new drug leads and new therapeutic protein targets. 
The completion of the human genome sequencing project 
has made it possible for us to analyze the genomic space 
of possible proteins coded in the human genome. At the 
same time, many efforts have also been devoted to the 
constitution of molecular databanks to explore the entire 
chemical space of possible compounds including synthe- 
sized molecules or natural molecules extracted from ani- 
mals, plants, or microorganisms. However, there is little 
knowledge about the interactions between compounds 
and proteins. For example, the US PubChem database 
stores more than 30 million chemical compounds, but the 
number of compounds with information on their target 
proteins is very limited [1]. In that field, the importance of 



^ Correspondence: yasuo.tabei@gmail.com 

VRESTO, Japan Science and Technology Agency, Kawaguchi, Saitama 332- 
0012, Japan 

Full list of author information is available at the end of the article 

(3 BioMed Central 



chemogenomics research has recently grown fast to inves- 
tigate the relationship between the chemical space and the 
genomic space [2,3]. A key issue in chemogenomics is 
computational prediction of compound-protein interac- 
tions on a genome-wide scale. 

Recently, a variety of in silico chemogenomic approaches 
have been developed to predict compound-protein inter- 
actions or drug-target interactions, assuming that similar 
compounds are likely to interact with similar proteins. 
The state-of-the-art in the chemogenomic approach is to 
built the chemogenomic space of compound-protein pairs 
as the tensor product of the chemical space of compounds 
and the genomic space of proteins, and analyze com- 
pound-protein pairs by machine learning classifiers such 
as support vector machine (SVM) [4-8]. However, the 
input of the SVM method in most previous works is the 
pairwise kernel similarity matrix of compound-protein 
pairs, which makes it difficult to analyze large-scale data. 
For example, it is impossible to apply standard implemen- 
tations such as LIBSVM [9] and SVM^^^^'[10], because it 
requires prohibitive computational time and the size of 
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the kernel matrix for compound-protein pairs is too huge 
to construct explicitly in the memory. All previous chemo- 
genomic methods are not suitable for scalable screening of 
millions of or billions of compound-protein pairs. 

Fingerprint is a powerful way to efficiently summarize 
information about various bio-molecules (e.g., compounds, 
proteins), that is, encoding their molecular structures or 
physicochemical properties into finite-dimensional binary 
vectors. The fingerprint representation has a long history 
in chemoinformatics, and many ID, 2D or 3D descriptors 
for molecules have been proposed [11] and adopted in 
many molecular databases such as PubChem [1] and 
ChemDB [12]. The fingerprints can be used for exploring 
the chemical space based on their Euclidian distance or 
Tanimoto coefficients, and can also be used as inputs of 
various machine learning classifiers to predict various bio- 
logical activities of compounds [13]. The fingerprint repre- 
sentation is applicable to proteins as well [14,15]. 

In this study we consider representing compound-pro- 
tein pairs by the fingerprints to use them as inputs of lin- 
ear SVM, because the linear SVM provides us with 
interpretable predictive models and works well for super- 
high dimensional data [16]. A straightforward way is to 
represent each compound-protein pair by taking the ten- 
sor product of the compound fingerprint and the protein 
fingerprint, which enables biological interpretation of che- 
mogenomic features (functional associations between 
compound substructures and protein domains) behind 
interacting compound-protein pairs [8]. However, the 
resulting fingerprint is sparse and super-high dimensional. 
Even worse, the total number of fingerprints is the product 
of the number of compounds and the number of proteins, 
so it is difficult to train classical linear SVM for extremely 
large-scale data. Although optimization techniques of lin- 
ear SVM have recently advanced [17-20], they are not 
enough to analyze a huge number of compound-protein 
pairs in practice. 

In this paper we develop a novel chemogenomic 
method to make a scalable prediction of compound- 
protein interactions from heterogeneous biological data 
using minwise hashing, which is applicable for virtual 
screening of a huge number of compounds against many 
human proteins. The proposed method mainly consists 
of two steps: 1) construction of new compact fingerprints 
for compound-protein pairs by an improved minwise 
hashing algorithm, and 2) application of the linear SVM 
to the compact fingerprints. A unique feature of the pro- 
posed method is that the linear SVM with the compact 
fingerprints generated by the minwise hashing is able to 
simulate the nonlinear property of the kernel SVM. We 
test the proposed method on its ability to make a large- 
scale prediction of compound-protein interactions from 
compound substructure fingerprints and protein domain 
fingerprints, and show superior performance of the 



proposed method compared with the previous chemoge- 
nomic methods in terms of prediction accuracy, compu- 
tational efficiency, and interpretability of the predictive 
model. All the previously developed methods are not 
computationally feasible for the fiill dataset consisting of 
about 200 millions of compound-protein pairs. 

Materials 

Compound-protein interactions involving human pro- 
teins were obtained from the STITCH database [21]. 
Compounds are small molecules and proteins belong to 
many different classes such as enzymes, transporters, 
ion channels, and receptors. The dataset consists of 
300,202 known compound-protein interactions out of 
216,121,626 possible compound-protein pairs, involving 
35,366 compounds and 6,111 proteins. Note that dupli- 
cated compounds were removed. The set of known 
interactions is used as gold standard data. 

Chemical structures of compounds were encoded by a 
chemical fingerprint with 881 chemical substructures 
defined in the PubChem database [1]. Each compound 
was represented by a substructure fingerprint (binary vec- 
tor) whose elements encode for the presence or absence 
of each of the 881 PubChem substructures by 1 or 0, 
respectively. 

Genomic information about proteins was obtained from 
the UniProt database [22], and the associated protein 
domains were obtained from the PFAM database [23]. 
Proteins in our dataset were associated with 4,137 PFAM 
domains. Each protein was represented by a domain fin- 
gerprint (binary vector) whose elements encode for the 
presence or absence of each of the retained 4,137 PFAM 
domains by 1 or 0, respectively. 

Methods 

We deal with the in-silico chemogenomics problem as the 
following machine learning problem: given a set of n com- 
pound-protein pairs (Ci, Pi),..., (C„, P„), then estimate a 
function y(C, P) that would predict whether a compound 
C binds to a protein P . In addition, we attempt to esti- 
mate an interpretable function / in order to extract 
informative features. Since our dataset consists of about 
216 millions of compound-protein pairs, we propose an 
efficient and general approach to solve these problems. 

Model 

Linear models are a feasible tool for large-scale classifica- 
tion and regression tasks such as linear support vector 
machines (linear SVM) and logistic regression which pro- 
vide comprehensible models for these tasks. Generally, 
linear models represent each example £ as a feature 
vector 0(£) g and then estimate a linear function 
f{E) = w^(^{E) whose sign is used to predict whether or 
not the example E is positive or negative. Note that 
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fingerprints are used for feature vectors in this study. The 
weight vector w e is estimated based on its ability to 
correctly predict the classes of examples in the training 
set. Since each element of the weight vector w corresponds 
to an element of the fingerprint 0(£), we can interpret 
salient features by sorting elements of 0(£) according to 
the values of the corresponding elements of w. 

In this study each compound-protein pair corresponds 
to an example. Thus, it is necessary to represent each com- 
pound-protein pair (C, P) as a single fingerprint 0(C, P) 
and then estimate a function /(C P) = m/^0(C P) whose 
sign is used to predict whether a compound C interacts 
with a protein P or not. As in the previous case, we can 
extract effective features in 0(C P) for compound-protein 
interaction predictions. 

Fingerprint representation of compound-protein pairs 

A fingerprint representation of compound-protein pairs 
has a large impact on not only classification ability of lin- 
ear models but also interpretability of features. To meet 
both demands, we represent each compound-protein pair 
by a fingerprint using the compound fingerprint and the 
protein fingerprint. 

The fingerprint of a compound C is represented by a 
£)- dimensional binary vector: 0(C) = (ci, C2, c^)^ where 
Ci G {0, 1}, i = 1, D. The fingerprint of a protein P is 
represented by a D'-dimensional binary vector as well: O 
(P) = (pi, Pd')T^ where g {0, 1}, / = 1, D', We 
define the fingerprint of each compound-protein pair as 
the tensor product of 0(C) and 0(P) as follows: 

(D(C,P) = O (C) (g) cD(P) 

= (Cipi, . . . , CipD', . . . , CDpl, . . . , CDpD')^' 

0(C P) consists of all possible products of elements in 
two fingerprints 0(C) and 0(P), so the fingerprint is a D x 
D' dimensional binary vector. The dimensions of 0(C), 
0(P), and 0(G P) in this study are D = 881, D' = 4, 137, 
and DD' = 3, 644, 697, respectively. 

Minwise hashing 

We propose to use minwise hashing for analyzing finger- 
prints efficiently. In this section, we make a brief review 
of minwise hashing [24]. A key observation is that any 
fingerprint can be represented by a set uniquely. Each 
fingerprint 0(C P) is represented by a set S ^ n = {1, 2, 
D X D'}, Given two sets Si and Sp Jaccard similarity 
/{Si, Sj) of Si and Sj is defined as 



\Si 


n 




\Si 


u 





Minwise hashing is a random projection of sets such 
that the expected Hamming distance of obtained symbol 



strings is proportional to the Jaccard similarity [24]. We 
pick £ random permutations ^fe , /c = 1, each of which 
maps [1, M] to [1, M]. Let Ti = tij, tie be a resultant 
string projected from Si, The projection is defined as the 
minimum element of the random permutation of the 
given set, 

tik = min{7rk (Si)} 

For example, if tt^ is defined as 

(1, 2, 3, 4, 5, 6, 7, 8) (3, 8, 7, 1, 2, 6, 4, 5), 

Si = (1, 4, 6, 7) is transformed to 7Tk{Si) = (3, 1, 6, 4), 
and the final product is ti/^ = 1. The collision probability, 
which is a probability that two sets Si and Sj are projected 
to the same elements tif^ and tj/^{ti/^ = tp^), is described as 

Therefore, the expected Hamming distance between ti 
and tj is identical to ^(1 - /{Si, Sj)). 

Saving memory by additional hashing 

The common practice of minwise hashing is to store each 
hashed value using 64bits [24]. The storage (and computa- 
tional) cost is prohibitive in large-scale applications. To 
overcome this problem, Li et al. proposed b-hit minwise 
hashing [25,26], which rounds each hashing value to only 
lower ^-bits value. However, a theoretical analysis of the 
collision probability is complicated. 

Here we introduce a simple yet effective method such 
that a theoretical estimation of collision probability can 
easily be derived. In our method, the hashing values are 
further hashed to a set {1, A/} randomly, where N «M. 
This projection is defined as follows: 

Sik = h (tik) , 

where h : {1, M} {1, A/} is a random hash func- 
tion. If tif^ and tjfc are identical, Si/^ and Sjf^ always collide. If 
not, they collide with probability l/N, Thus, the collision 
probability is obtained as follows: 

Pr [sik = Sjk] = 1 - (1 - / (Si, Sj)) . (1) 

Figure 1 shows collision probability for each hashing 
value, where four different Jaccard similarities, 0.1, 0.3, 0.5 
and 0.7, are chosen. It is observed that collision probabil- 
ities do not increase for hashing values of no less than 2^. 
Thus, small hashing values can be chosen without loss of 
accuracy. 

Building compact fingerprints by minwise hashing 

Learning linear models with large-scale high-dimen- 
sional data is a difficult problem in terms of computa- 
tional cost. Here we propose a method to represent the 
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Figure 1 Collision probabilities for varying the size of additional hashing value N. 



original fingerprint of compound-protein pair by a new 
fingerprint whose size is smaller than that of the original 
fingerprint. 

A crucial observation is that any fingerprint can be 
represented as a set uniquely, and can also be converted 
into a string uniquely. First, we convert the original fin- 
gerprint of each compound-protein pair into a string by 
applying minwise hashing and additional hashing. Next, 
we expand hashing values organizing the string into a 
new binary vector whose dimension is much smaller than 
that of the original fingerprint. 

Let S{Q P) be a set representation of 0(C P) where / is 
contained in S{C, P) iff the i-th element of 0(C, P) is 1. 
We apply minwise hashing n/^ik = 1, £) to S{C, P) to 
generate a string T{C, P) = ti, t2, ti, where each element 
tj^ takes a value ranging from 1 to M, We additionally hash 
each element tj^ to a new small value ranging from 1 to 
N{N «M) by applying additional hash h, and generate a 
new string T' (C, P) = t[, tj, . . . , t^. Each value in the 
string T'{C, P) is expanded to an A^-dimensional binary 
vector where the t^-th element is 1 and the others are 0. 



Finally, we concatenate /i, ...,/^ into a single one, and 
obtain an W-dimension binary vector F{C, P) = (/i, ...,^). 
The newly obtained F{C, P) is referred to as "compact 
fingerprint". Figure 2 shows an illustration of the proposed 
procedure. 

Linear support vector machines (Linear SVM) 

We use linear SVM as a classifier. The predictive model 
is typically learned by minimizing objective functions 
with a regularization. The most common regularization 
is L2-regularization which keeps most elements in the 
weight vector to be non-zeros, so one suffers from diffi- 
culty in interpreting the predictive model with many 
non-zero weights. L2-regularized linear SVM is referred 
to as L2SVM. Another regularization is Li -regularization 
which keeps most elements in the weight vector to be 
zeros, so the Li-regularization is popularly used for 
its high interpretability owing to the induced sparsity. 

-regularized linear SVM is referred to as LISVM. 

Given a training set of compound-protein pairs and 
labels {F(Q,P0, yi}[li , Yi e {+1, - 1}, linear SVM is 
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D and D' dimension fingerprints of compound and protein 
$(C)=(1,0,1 1) $(P)=(0,0,1,...,1) 

^ Compute the tensor product (S?) $(P) 



DXD' dimension fingerprint 
. ^(C.P)=(0.0.1 1) 



^ Convert a fingerprint to a set 



0 



A set representation 
S(C,P)=(3,6,10,15,50)j 

iApply minwise hashing^=4: 

. ^ TTk (k=1 ,2,3,4) 

String of length 4 

T(C,P)=8,10,2,15 

^Apply additional hashing N=4: h 



(r (C,P)=1, 3,3,4) 

I Expand hashing values 
▼ into^A^ dimension fingerprints 



16 dimension compact feature vector 
F(C,P)= (0,0,0,1|,0,1 ,0,0[0,1 ,0,0[1 ,0,0,0) 

> ^ 

Figure 2 Construction of a compact fingerprint. The vertical bars in F(C P) are inserted for readability. Each range represented by the vertical 
bars in F(C P) includes elements expanded from a hashing value. 



formulated as the following unconstrained optimization 
problem: 

n 

min ^ max { 1 - yiiu^F (Q, PO , O} . (2) 

1=1 

To prevent overfitting, the weight vector is optimized 
with Li-regularization and L2-regularization as follows: 

n 

min||M;||n-^max{l -yiU/^FCQ, Pi), O} . (3) 

1=1 

and 

n 

min 1 1 ti;| I2+ ^ max { 1 - yiuP^F (Q, Pi) , O} . (4) 

1=1 

where 1 |i and 1 I2 are and L-i norms, and C is 
a hyper-parameter. Recently, optimization algorithms for 
linear SVM have rapidly advanced. In this study, we use 
an efficient optimization algorithm named LIBLINEAR 
[18]\ 

^The software is available from http://www.csie.ntu. 
edu.tw/~cjlin/liblinear/ 

In our method, we propose to use the compact finger- 
print F(C, P) instead of the original fingerprint 0(C, P) as 



an input for LISVM and L2SVM. LISVM and L2SVM 
with the compact fingerprints ¥[C, P) are referred to as 
Minwise Hashing-based LISVM (MH-LISVM) and Min- 
wise Hashing-based L2SVM (MH-L2SVM), respectively. 
In contrast, LISVM and L2SVM with the original finger- 
prints 0(C, P) are referred to as LISVM and L2SVM, 
respectively, which correspond to previous methods [8]. 

In most previous works the kernel SVM method was 
used, but the input of kernel SVM is the kernel similar- 
ity matrix for compound-protein pairs [5,6], which 
makes it difficult to apply the kernel SVM to large-scale 
interaction prediction. This is because the time com- 
plexity of the quadratic programming problem for ker- 
nel SVM is 0(n^ X n^), where ric is the number of 
compounds and Hp is the number of proteins, and the 
space complexity is 0(n^ x n^), which is just for stor- 
ing the kernel matrix. Moreover, kernel SVM does not 
have any interpretability of the predictive model because 
it is not able to extract features. 

Relation to kernel SVM 

In this section, we describe a theoretical foundation for 
using linear SVM with compact fingerprints and discuss 
the relation to kernel SVM [5,6]. Kernel matrix is an n x n 
matrix K satisfying J^ij ^i^j^ij ^ 0 for all real vectors c. 
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Such a property is called positive definite (PD), which is 
necessary to effectively train an SVM classifier with a ker- 
nel matrix. A matrix A is PD if it can be written as an 
inner product of matrices B^B. 

Our linear SVM with compact fingerprints simulates 
non-linear SVMs with the Jaccard similarity matrix for 
the following reasons. 

1. Each element of the pairwise kernel matrix of 
compound-protein pairs is defined as the number of 
common elements between two sets S{C, P) and S{C 
; P% i.e, \S{Q P) n S{C\ P')\, The pairwise kernel 
matrix is PD. Jaccard similarity is a pairwise kernel 
normalized by the cardinality of the union of two 
sets S{Q P) and S{C\ P% i.e., \S{Q P) U S{C', P')\. 
The Jaccard similarity matrix of compound-protein 
pairs, where each element is Jaccard similarity of 
two sets S{Q P) and S{C, P'), is also PD. 

2. Let the minwise hashing matrix of compound- 
protein pairs be a matrix whose element is defined 
as the inner product of two compact fingerprints F 
(C, P) and F{C', P'). The minwise hashing matrix is 
PD. 

3. The {i, y) -element of the Jaccard similarity matrix 
correlates with the (/, 7)-element of the minwise 
hashing matrix. 

4. While Jaccard similarity is a non-linear function, 
the inner product is a linear function. 

The third reason is true because the collision prob- 
ability, which is a probability that two minwise hashing 
and additional hashing values for two sets S{Q P) and S 
(C, P') are the same, is positively correlated with Jaccard 
similarity J{S{Q P), S{C, P')) (Equation 1). 

Feature extraction for biological interpretation 

Extracting informative features in the original finger- 
print for predicting compound-protein interactions is 
also an important task. Since each value of the weight 
vector in a linear model corresponds to the importance 
of the corresponding feature of the original fingerprint 
in the classification task. In our method, we apply min- 
wise hashing and additional hashing to the original fin- 
gerprint, and build the compact fingerprint to efficiently 



train a linear SVM classifier. Thus, it is not trivial to 
extract features in the original fingerprint in our 
framework. 

We propose to keep inverse mappings and h'^ for 
permutation 7tk and additional hashing /z, and apply h'^ 
and 71^^ to each element in the compact fingerprint in 
order to recover the weight vector for the original fin- 
gerprint. Let 7T^^ : [1, M] [1, M] {k = 1, £) be an 
inverse mapping for permutation tt/^ : [1, M] [1, M]. 
Let h'^ : [1, N] [1, M] be an inverse mapping for 
additional hashing h : [1, M] [1, N], Note that h'^ is, 
basically, a one-to-many mapping N «M, 

First, we apply inverse mapping h'^ to each element in 
the compact fingerprint to recover values hashed by 
additional hashing h. Since h'^ is a one-to-many map- 
ping, several values are recovered. Then, inverse map- 
ping n'^ is applied to each value in order to recover an 
element in the original fingerprint. Finally, we compute 
an average of the weights learned by linear SVMs, 
which provides the recovered weight vector for the ori- 
ginal fingerprint. Figure 3 shows an illustration of the 
proposed procedure. 

Results 

Performance evaluation 

We tested MH-LISVM and MH-L2SVM (newly pro- 
posed methods) on their abilities to predict compound- 
protein interactions from compound substructure fin- 
gerprints and protein domain fingerprints, and com- 
pared the performance with LISVM and L2SVM 
(previous methods [8]) in terms of prediction accuracy 
and computational cost. Note that the kernel SVM (the 
state-of-the-art [4-7]) was not computationally feasible 
for our large data. Our full dataset is too huge (consists 
of about 216 millions of compound-protein pairs), so we 
used a subset of the full data for efficient evaluation of 
the four different methods. In the sub-dataset, the num- 
bers of positive and negative examples were balanced, i. 
e., 300,202, respectively and 600,404 in total. We per- 
formed two types of 5-fold cross-validations: pair-wise 
cross-validation and block-wise cross-validation. 

In the pair-wise cross-validation we perform the follow- 
ing procedure: 1) We randomly split compound-protein 
pairs in the gold standard set into five subsets of roughly 



i)N=8, M=3, 1=2. 
values hashed by , 7r2 , , 
and 6 (=3 X 2) dimension weight vector 
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ii) Hash each value by h^ 
h Hl)=(3,6), h-'(2)=(1,4,7), h-^(3)=(2,5,8) 

Figure 3 Recovery of a weight vector. 



iii) Hash each value by , tt^ 

7rr'(3) = 8 7r,-^(6) = 7 I 7rf^(l) = 5 7rf^(4) = 2 n^\7) = 4 |7rf^(2) = l7rr'(5) = 6 7rf'(8) = 3 
7r7'(3) = 6 772-^(6) = 5 I ^(1) = 7 TT^^iA) = 4 7r2-^(7) = 8 | iT^^i2) = 3 7r2-^(5) = 2 7r2-^(8) = 1 

iv) Compute averages of weights, and recover weight vector for original fingerprints 
w= ((2+1+2+1 )/4,(-1 -3+2+1 )/4,(2+1 +2+1 )/4,(-1 -3-1 -3)/4, 

(-1-3+2+1)/4,(2+1+3-2)/4,(3-2+3-2)/4,(3-2-1-3)/4) 
= (1 .5,-0.125,1 .5,-2,-0.25,1 ,0.5,-0.75) 



Tabei and Yamanishi BMC Systems Biology 2013, 7(Suppl 6):S3 
http://www.biomedcentral.eom/1752-0509/7/S6/S3 



Page 7 of 1 3 



equal sizes, and take each subset in turn as a test set. 2) We 
train a predictive model on the remaining four subsets. 
3) we compute the prediction scores for compound-protein 
pairs in the test set. 4) Finally, we evaluate the prediction 
accuracy over the five folds. The pair- wise cross-validation 
assumes the situation where we want to detect missing 
interactions between known ligand compounds and known 
target proteins with information about interaction partners. 
In the block-wise cross-validation we perform the following 
procedure: 1) We randomly split compounds and proteins 
in the gold standard set into five compound subsets and 
five protein subsets, and take each compound subset and 
each protein subset in turn as test sets. 2) We train a pre- 
dictive model on compound-target pairs in the remaining 
compound subsets and four protein subsets. 3) We com- 
pute the prediction scores for compound-protein pairs 
involving test compound set and test protein set. 4) Finally, 
we evaluate the prediction accuracy over the five folds. The 
block- wise cross-validation assumes the situation where we 
want to detect new interactions for newly arriving ligand 
candidate compounds and target candidate proteins with 
no information about interaction partners. In the both 
cases, we evaluated the performance by the area under the 
ROC curve (AUG) and execution time. The cross-valida- 
tions were performed by varying the hyper-parameter C = 
10'^, 10'^, 10^ and chosen as the one to achieve the best 
AUG score. 

We investigated the effects of the length of strings / and 
the size of hashing values N in the minwise hashing process 
of MH-LISVM and MH-L2SVM on the performance. We 
tried five different lengths of string ^ = 5, 10, 15, 30, 50. 
The size of additional hashing values N is varied from 



2^ to 2^^. Fi sure 4 and 5 shows the AUG scores for MH- 
LISVM and MH-L2SVM in the pair-wise cross validation. 
It was observed that the AUG scores reached the maximum 
with the length of string ^ = 10 and the size of additional 
hashing value N = 2^^, and the AUG score was comparable 
to that for the original fingerprint. 

Figure 6 and 7 shows the execution time for perform- 
ing the minwise hashing and for learning SVM classifiers, 
where the length of string £ is varied from 5 to 50 and 
the size of additional hashing value is fixed to N = l}^ , 
The AUG scores of MH-LISVM and MH-L2SVM with 
the length of string ^ = 10 and the size of additional 
hashing N = 2^^ were comparable to those of LISVM 
and L2SVM. In addition, MH-LISVM and MH-L2SVM 
achieved certain speedup compared with LISVM and 
L2SVM. 

The same trends of these results in the pair-wise cross- 
validation were observed in the case of the block-wise 
cross-validation as well. The corresponding results for the 
block-wise cross-validation are shown in Figures 8, 9, 10 
and 11. The AUG scores in the block-wise cross-validation 
were lower than those in the pair-wise cross-validation, 
which implies that predicting unknown interactions for 
newly coming compounds and proteins outside of the 
learning set is much more difficult than detecting missing 
interactions between compounds and proteins in the 
learning set. 

Experiments on large-scale datasets 

We evaluated the performance for the full data consist- 
ing of 216,121,626 compound-protein pairs, where the 
best parameter values for each method in the cross- 
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validation experiments in the previous subsection were 
used. We examined the effect of the ratio of positive 
compound-protein pairs against negative compound- 
protein pairs on the performance. Note that the number 
of negative examples is much larger than that of positive 
examples in our dataset. We varied the number of nega- 
tive examples in the cross-validation from the same 
number of positive examples to the number of all possi- 
ble negative examples. 

Figure 12 shows the memory usages of the four differ- 
ent methods. It was observed that the memory usage 



grew linearly as the number of compound-protein pairs 
increased in each method. Especially, both LISVM and 
L2SVM required about 200GB in memory. On the other 
hand, MH-LISVM and MH-L2SVM took only about 
30GB in memory. There is little difference of memory 
usage between Ll-regularization and L2-regularization. 

Table 1 shows the AUC scores in the pair-wise cross- 
validation. It was observed that the AUC scores of MH- 
LISVM and MH-L2SVM were comparable to those of 
LISVM and L2SVM, respectively. Table 2 shows training 
time on the pair-wise cross-validation, where the training 
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time includes the minwise hashing process and the upper 
limitation is put on the execution time for all methods to 
24 hours. MH-LISVM and MH-L2SVM are significantly 
faster than LISVM and L2SVM, respectively. Especially, 
MH-L2SVM is about 10 times faster than L2SVM. 
LISVM did not finish the computation for such a large 
number of compound-protein pairs within 24 hours. On 
the other hand, our MH-LISVM finished the computation 
and took only 25,060 seconds on average. 

The same trends for these results in the pair-wise cross- 
validation were observed in the block-wise cross-validation 
as well (See Tables 3 and 4). 



Table 5 shows the AUG scores and training times in 
using all possible negative examples, where only 1-fold of 
the 5-fold cross-validation was performed on this dataset. 
On this extremely large data, LISVM and L2SVM did not 
finish the computation within 24 hours. On the other hand, 
MH-LISVM and MH-L2SVM finished the computation, 
and the AUG scores were reasonable. The training times of 
MH-LISVM and MH-L2SVM were 157,013 and 10,054 
seconds, respectively. These results suggest the usefulness 
of our proposed methods in large-scale applications. 

Figure 13 shows the numbers of features extracted by 
MH-LISVM and MH-L2SVM. The number of features 
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Table 1 AUC score on pair-wise cross validation experiments 



Ratio 


Number 


MH-LISVM 


MH-L2SVM 


L1SVM 


L2SVM 


1 


600, 404 


0.78 ±2.31 X 10"^ 


0.79 ±2.31 X 10"^ 


0.79 ± 3.22 X 10"^ 


0.80 ± 4.97 X 10"^ 


5 


1, 801, 212 


0.79 ± 7.23 X 10"^ 


0.80 ± 8.30 X 10"^ 


0.81 ± 2.04 X 10"^ 


0.81 ± 2.04 X 10"^ 


10 


3, 302, 222 


0.79 ± 1.84 X 10"^ 


0.80 ± 1.35 X 10"^ 


0.81 ± 5.34 X 10"^ 


0.81 ± 4.31 X 10"^ 


25 


7, 805, 252 


0.79 ± 2.89 X 10"^ 


0.80 ± 6.28 X 10"^ 


0.81 ± 9.87 X 10"^ 


0.81 ± 1.30 X 10"^ 


50 


15, 310, 302 


0.79 ± 3.21 X 10"^ 


0.81 ± 3.79 X 10"^ 


0.81 ± 3.40 X 10"^ 


0.81 ± 1.72 X 10"^ 


100 


30, 320, 402 


0.79 ± 2.38 X 10"^ 


0.81 ± 1.49 X 10"^ 




0.81 ± 2.43 X 10"^ 


250 


75, 350, 702 


0.79 ± 2.91 X 10"^ 


0.81 ± 2.42 X 10"^ 




0.81 ± 3.66 X 10"^ 



The number of negative examples is varied from the same number of positive examples to the number of negative examples 250 times larger than the number 
of all positive examples. 



extracted by MH-LISVM are about third times smaller 
than that of features extracted by MH-L2SVM. This 
result suggests that MH-LISVM provides us with more 
selective features, which would help to make a biological 
interpretation about the functional associations between 
compound substructures and protein domains behind 
compound-protein interactions. 



Discussion and conclusion 

In this paper we proposed a novel chemogenomic method 
to predict unknown compound-protein interactions on a 
large scale, which was made possible by using an improved 
minwise hashing algorithm to efficiently represent the fin- 
gerprints of compound-protein pairs. Interestingly, the lin- 
ear SVM with the compact fingerprints generated by the 



Table 2 Training time on pair-wise cross validation experiments 


Ratio 


Number 


MH-LISVM 


MH-L2SVM 


LI SVM 


L2SVM 


1 


600, 404 


29 ± 1 


28 ± 1 


188 ± 32 


387 ± 63 


5 


1, 801, 212 


172 ± 5 


38 ± 2 


1, 655 ± 156 


963 ± 81 


10 


3, 302, 222 


448 ± 41 


261 ± 7 


1, 261 ± 579 


10, 798 ± 1, 981 


25 


7, 805, 252 


1, 808 ± 181 


732 ± 17 


20,067 ± 1,453 


4, 623 ± 782 


50 


15, 310, 302 


1,140 ± 90 


811 ±41 


58, 045 ± 5, 678 


8, 936 ± 1, 412 


100 


30, 320,402 


7, 601 ± 627 


1,643 ± 50 


> 24hours 


16, 608 ± 2, 732 


250 


75, 350, 702 


25,060 ± 12,417 


4,631 ± 795 


> 24hours 


43, 843 ± 7, 200 



The training time includes minwise hashing time. The number of negative examples is varied from the same number of positive examples to the number of 
negative examples 250 times larger than the number of all positive examples. 
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Table 3 AUC scores on block-wise cross validation experiments 


Ratio 


Number 


MH-L1SVM 


MH-L2SVM 


L1SVM 


L2SVM 


1 


600, 404 


0.66 ± 0.00 


0.66 ± 0.01 


0.65 ± 0.01 


0.67 ± 0.01 


5 


1, 801, 212 


0.66 ± 0.01 


0.66 ± 0.01 


0.66 ± 0.01 


0.67 ± 0.01 


10 


3, 302, 222 


0.66 ± 0.01 


0.67 ± 0.01 


0.66 ± 0.01 


0.67 ± 0.01 


25 


7, 805, 252 


0.66 ± 0.01 


0.66 ± 0.01 


0.65 ± 0.01 


0.66 ± 0.01 


50 


15, 310, 302 


0.66 ± 0.01 


0.66 ± 0.01 


0.65 ± 0.01 


0.66 ± 0.01 



The number of negative examples is varied from the same number of positive examples to the number of negative examples 50 times larger than the number 
of all positive examples. 



Table 4 Training times on block-wise cross validation experiments 


Ratio Number 


MH-L1SVM 


IVIH-L2SVM 


L1SVM 


L2SVIVI 


1 600, 404 


8 ± 0 


7 ± 1 


131 ± 7 


1 1 7 ± 28 


5 1,801,212 


76 ± 9 


31 ± 1 


982 ± 184 


252 ± 30 


10 3,302,222 


237 ± 23 


55 ± 4 


3925 ± 92 


475 ± 93 


25 7, 805, 252 


582 ± 79 


107 ± 4 


2606 ± 265 


322 ± 20 


50 15,310,302 


1889 ± 82 


243 ± 8 


7133 ± 664 


729 ± 18 



The training time includes minwise hashing time. The number of negative examples is varied from the same number of positive examples to the number of 
negative examples 50 times larger than the number of all positive examples. 



Table 5 AUC score and training time on the full data consisting of all 216,121,626 compound-protein pairs 



Measure 


MH-L1SVM 


MH-L2SVM 


L1SVM 


L2SVM 


AUC score 


0.79 


0.81 






Training Time (sec) 


157, 013 


10, 054 


> 24hours 


> 24hours 



minwise hashing is able to simulate the nonlinear property 
of the kernel SVM (the state-of-the-art). The originality of 
the proposed method lies in the scalable prediction of 
compound-protein interactions, in the computational effi- 
ciency, and in the interpretability of the predictive model. 
It should be pointed out that all previous methods were 



not computationally feasible for the full data. The pro- 
posed method is expected to be useful for virtual screening 
of a large number of compounds against many protein 
targets. 

The proposed method can be used, as soon as com- 
pounds and proteins are represented by binary descriptors 
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(chemical substructures and protein domains in this 
study). However, a Umitation of the proposed method is 
that the performance depends on the definitions of chemi- 
cal substructures of compounds and functional domains 
of proteins. The use of other descriptors (e.g., KlekotaR- 
oth, ECFP6, Daylight, and Dragon) could improve the 
generalization properties of the method. Datasets, all 
results and softwares are available at https://sites.google. 
com/site/interactminhash/. 
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